perf: avoid O(N^2) exiting-branch checks in CodeFolding by Changqing-JING · Pull Request #8599 · WebAssembly/binaryen

Changqing-JING · 2026-04-14T03:38:13Z

Follow up PR of #8586 to optimize CodeFolding

optimizeTerminatingTails calls EffectAnalyzer per tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).

Replace the per-item walks with a single O(N) bottom-up PostWalker (populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).

Example: AssemblyScript GC compiles __visit_members as a br_table dispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).

(block $A          ;; depth 4000
  (block $B        ;; depth 3999
    (block $C      ;; depth 3998
      ...
      (br_table $A $B $C ... (local.get $rtid))
    )
    (unreachable)  ;; tail at depth 3999, old code walks 3999 nodes
  )
  (unreachable)    ;; tail at depth 4000, old code walks 4000 nodes
)

benchmark data
The test module is from issue #7319
#7319 (comment)

In main head

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    9m16.111s
user    35m33.985s
sys     0m51.000s

In the PR

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    5m17.170s
user    30m9.198s
sys     0m28.030s

kripken · 2026-04-15T19:05:16Z

+  // efficient bottom-up traversal.
+  bool hasExitingBranches(Expression* expr) {
+    if (!exitingBranchCachePopulated_) {
+      populateExitingBranchCache(getFunction()->body);


Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.

This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.

The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.

We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.

We also cannot reuse a child's cached bool to compute a parent's result [..] I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

What is the baseline here? (is it before this PR, or the PR's current state)

Current main costs 9min
Pr current stauts cost 5min
The per-node set allocation cost 13min

I see, thanks. Ok, it might really make sense to scan the whole function then, in a fast way, rather than less code in a slower way.

kripken

Looks good but I'll run some local fuzzing before landing.

kripken · 2026-04-17T00:03:21Z

Unfortunately I see opposite results locally. I tried the two Dart files linked here:

https://chromium-review.git.corp.google.com/c/emscripten-releases/+/7769309

I measured like this:

taskset -c 0-3 perf stat -r 10 bin/wasm-opt dart-complex.unopt.wasm -all --code-folding --code-folding

taskset makes sure to run on the performance cores on my machine, reducing noise. It then averages 10 runs of two runs of the pass (the second adds some measurement of what happens when the pass finds no work to do). Data for the larger file ("complex"):

Without the patch:

     6,979,952,729      cpu_core/instructions/u          #    1.71  insn per cycle              ( +-  0.00% )
     1,582,119,997      cpu_core/branches/u              #  624.871 M/sec                       ( +-  0.00% )

            1.2636 +- 0.0134 seconds time elapsed  ( +-  1.06% )

And with the patch:

     7,250,548,090      cpu_core/instructions/u          #    1.70  insn per cycle              ( +-  0.00% )
     1,639,876,729      cpu_core/branches/u              #  623.784 M/sec                       ( +-  0.00% )

           1.28402 +- 0.00922 seconds time elapsed  ( +-  0.72% )

The seconds elapsed regressed, though that might in theory be due to noise. The # of instructions and branches is extremely stable though, and they regress by 3-4%.

Perhaps you can take a look at the larger of those two Dart files and see if you get the same issue locally?

Changqing-JING · 2026-04-17T09:54:05Z

Thanks for the feedback!

My dart test result

I tried to run the dart case on my laptop,

I have run it for taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding for 5 times, the time cost is from 0.6516 to 0.7846.

So the dart test case seems too small to test this PR. It's very hard to measure 3% regression on it.

My research report

I reworked the approach to avoid the conservative childFromPriorWalkHasExiting hack and store precise name sets instead of just bools.

on-demand cache:
Changqing-JING@733935f)
I optimized this solution today, now at least it's better than main branch.

pre-fill whole-function scan

Instead of on-demand per-query walks, do a single whole-function walk upfront in doWalkFunction to populate the cache for all expressions at once.

Approach	test3.wasm (pathological deep nesting)
main (no cache)	7m33s
whole-function scan(this PR current status)	4m08s
On-demand cache	5m35s

Pre-fill is ~25% faster on the pathological case because it walks the tree exactly once with a single set of transient name sets, while on-demand creates and destroys name sets per query and copies cached name sets from prior walks into new ones. On the other hand, pre-fill pays an upfront cost for every function even when hasExitingBranches is never called, which can regress normal workloads (3-4% on dart). The on-demand version has zero cost when the cache isn't needed, at the expense of being slower on the worst case.

Now both the pre-fill version and on-demand version are better than main in test3.wasm case.

I've pushed the on-demand version. Would you prefer the the on-demand version instead, or is pre-fill approach acceptable?

Want me to adjust anything?

Attachments:

Run 1:

Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,247,751,877      task-clock                       #    1.915 CPUs utilized               ( +-  2.05% )
                60      context-switches                 #   48.086 /sec                        ( +-  3.11% )
                10      cpu-migrations                   #    8.014 /sec                        ( +-  7.34% )
            35,331      page-faults                      #   28.316 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6516 +- 0.0121 seconds time elapsed  ( +-  1.86% )

Run 2:

 taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,229,511,099      task-clock                       #    1.824 CPUs utilized               ( +-  2.42% )
                56      context-switches                 #   45.547 /sec                        ( +-  3.46% )
                10      cpu-migrations                   #    8.133 /sec                        ( +- 10.16% )
            35,328      page-faults                      #   28.733 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

            0.6740 +- 0.0122 seconds time elapsed  ( +-  1.81% )

Run3:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,368,949,315      task-clock                       #    1.745 CPUs utilized               ( +-  2.05% )
                58      context-switches                 #   42.368 /sec                        ( +-  2.98% )
                10      cpu-migrations                   #    7.305 /sec                        ( +- 14.05% )
            35,335      page-faults                      #   25.812 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7846 +- 0.0131 seconds time elapsed  ( +-  1.67% )

Run 4:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,175,147,864      task-clock                       #    1.776 CPUs utilized               ( +-  1.59% )
                56      context-switches                 #   47.654 /sec                        ( +-  3.36% )
                 8      cpu-migrations                   #    6.808 /sec                        ( +- 10.21% )
            35,333      page-faults                      #   30.067 K/sec                       ( +-  0.01% )
   <not supported>      cycles                                                                

           0.66184 +- 0.00673 seconds time elapsed  ( +-  1.02% )

Run 5:

taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output

 Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):

     1,327,088,464      task-clock                       #    1.842 CPUs utilized               ( +-  2.24% )
                57      context-switches                 #   42.951 /sec                        ( +-  3.34% )
                 9      cpu-migrations                   #    6.782 /sec                        ( +-  8.83% )
            35,332      page-faults                      #   26.624 K/sec                       ( +-  0.00% )
   <not supported>      cycles                                                                

            0.7204 +- 0.0122 seconds time elapsed  ( +-  1.70% )

kripken · 2026-04-20T17:11:48Z

It looks like you're on a machine where perf stat can't access hardware counters. That makes it harder to benchmark. I suggest

More iterations (perf stat -r 20 or even higher numbers)
Make sure taskset is right for your system. It should pin the performance cores, to avoid the slower cores adding noise, but it depends on your actual CPU.
A larger file. Here is a large Kotlin file, from the KotlinConf app: https://jetbrains.github.io/kotlinconf-app/73cbe24d7cf5a54d37ad.wasm

Here is what I see without this PR on that Kotlin file (with -r 10):

     8,371,439,033      cpu_core/instructions/u          #    1.84  insn per cycle              ( +-  0.00% )
     1,780,305,514      cpu_core/branches/u              #  633.982 M/sec                       ( +-  0.00% )
           1.33927 +- 0.00243 seconds time elapsed  ( +-  0.18% )

And with this PR:

     8,861,366,837      cpu_core/instructions/u          #    1.81  insn per cycle              ( +-  0.00% )
     1,886,380,376      cpu_core/branches/u              #  623.892 M/sec                       ( +-  0.00% )
            1.4126 +- 0.0114 seconds time elapsed  ( +-  0.81% )

The regression is significantly larger than the noise.

Changqing-JING · 2026-04-21T08:21:37Z

@kripken
I have updated this PR to On-demand cache solution.
This solution is slower than the previous one on test3.wasm, but can avoid regression on the dart and kotlin case, could you help to review this?

kripken · 2026-04-22T19:45:44Z

I don't see any code pushed since my comment here which measured a Kotlin regression:

#8599 (comment)

Which commit addresses that comment?

Changqing-JING

@kripken Thank you for reminder.
It's strange, in my vscode I confirmed the commit was pushed. I also can see it in the "Files Changed"
I also awared that github web gui didn't show this activity, maybe some github bug here.

To avoid confusion, I just pushed this commit with another id again. Now you can see the commit 94ae0be.

Co-authored-by: Copilot <copilot@github.com>

kripken · 2026-04-30T16:05:53Z

+  // transiently (moved from children, erased after merge). Only the root's
+  // name set is persisted. Already-cached subtrees are skipped via scan(),
+  // and their cached names are merged in precisely.
+  bool populateExitingBranchCache(Expression* root) {


The returning of bool here is a little odd and non-obvious, I think (it returns if the root has exiting branches, not whether we populated the cache or anything like that). How about removing the result, and from hasExitingBranches(), populating if necessary and then reading the cache?

Another option might be for this to return a const std::unordered_set<Name>& for the root. That would be unambiguous, together with a comment that explains it is an optimization to avoid another lookup after?

Changqing-JING requested a review from a team as a code owner April 14, 2026 03:38

Changqing-JING requested review from kripken and removed request for a team April 14, 2026 03:38

Changqing-JING marked this pull request as draft April 14, 2026 03:38

Changqing-JING mentioned this pull request Apr 14, 2026

wasm-opt -Oz takes an inordinate amount of time #7319

Open

avoid O(N^2) exiting-branch checks in CodeFolding

66dff99

Changqing-JING force-pushed the opt/compile-speed branch from 1dae3f3 to 66dff99 Compare April 14, 2026 04:27

Changqing-JING marked this pull request as ready for review April 14, 2026 04:59

Changqing-JING mentioned this pull request Apr 14, 2026

[NFC] cache repeated tree walks to avoid O(N^2) in optimizeTerminatingTails in CodeFolding #8602

Open

kripken reviewed Apr 14, 2026

View reviewed changes

Comment thread src/passes/CodeFolding.cpp Outdated

Fix review

f263f08

Changqing-JING force-pushed the opt/compile-speed branch from daf81f7 to f263f08 Compare April 15, 2026 04:41

Changqing-JING requested a review from kripken April 15, 2026 05:41

kripken reviewed Apr 15, 2026

View reviewed changes

Fix review

b90aee7

Changqing-JING requested a review from kripken April 16, 2026 01:58

kripken approved these changes Apr 16, 2026

View reviewed changes

Changqing-JING force-pushed the opt/compile-speed branch 3 times, most recently from 733935f to b90aee7 Compare April 17, 2026 09:32

Changqing-JING marked this pull request as draft April 17, 2026 09:37

Changqing-JING marked this pull request as ready for review April 17, 2026 09:54

Changqing-JING commented Apr 23, 2026

View reviewed changes

fix kotlin and dart regression

94ae0be

Changqing-JING force-pushed the opt/compile-speed branch from 733935f to 94ae0be Compare April 23, 2026 00:10

kripken reviewed Apr 28, 2026

View reviewed changes

Comment thread src/passes/CodeFolding.cpp Outdated

Fix review

061a332

Co-authored-by: Copilot <copilot@github.com>

kripken reviewed Apr 30, 2026

View reviewed changes

Conversation

Changqing-JING commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kripken Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Changqing-JING Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Changqing-JING Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

kripken commented Apr 17, 2026

Uh oh!

Changqing-JING commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My dart test result

My research report

Attachments:

Uh oh!

kripken commented Apr 20, 2026

Uh oh!

Changqing-JING commented Apr 21, 2026

Uh oh!

kripken commented Apr 22, 2026

Uh oh!

Changqing-JING left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kripken Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kripken Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changqing-JING commented Apr 14, 2026 •

edited

Loading

Changqing-JING commented Apr 17, 2026 •

edited

Loading

Changqing-JING left a comment •

edited

Loading